This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
library(dslabs)
installed.packages()
library(dslabs) data(Teams) Warning message: In data(Teams) : data set ‘Teams’ not found data(teams) Warning message: In data(teams) : data set ‘teams’ not found data(murders) population Error: object ‘population’ not found murders$population
pop <- murders\(population length(pop) [1] 51 class(pop) [1] "numeric" class(murders\)state) [1] “character”
library(Lahman) Error in library(Lahman) : there is no package called ‘Lahman’ install.packages(“Lahman”)
install.packages(“tidyverse”)
library(Lahman) library(tidyverse)
Teams %>% filter(yearID %in% 1961:2001) %>% + mutate(HR_per_game = HR/G, R_per_game = R/G) %>% + lm(R_per_game ~ BB, .)
Call: lm(formula = R_per_game ~ BB, data = .)
Coefficients: (Intercept) BB
2.582186 0.003396
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(R_per_game, AB_per_game)) + + geom_point()
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_line()
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>% + ggplot(aes(AB, R)) + + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_point(alpha = 0.5)
?Teams
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(number_of_wins_per_game = W/G, fielding_errors_per_game = E/G) %>% + ggplot(aes(number_of_wins_per_game, fielding_errors_per_game)) + + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001) %>% + mutate(win_rate = W / G, E_per_game = E / G) %>% + ggplot(aes(win_rate, E_per_game)) + + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(triple_per_game = X3B/G, double_per_game = X2B/G) %>% + ggplot(aes(triple_per_game, double_per_game)) + + geom_point(alpha = 0.5)
As motivation for this course, we’ll go back to 2002 and try to build a baseball team with a limited budget. Note that in 2002, the Yankees payroll was almost $130 million, and had more than tripled the Oakland A’s $40 million budget. [][Statistics have been used in baseball since its beginnings]. Note that the data set we will be using, included in the Lahman Library, goes back to the 19th century. For example, a summary of statistics we will describe soon, the batting average, has been used to summarize a batter’s success for decades. Other statistics such as home runs, runs batted in, and stolen bases, we’ll describe all this soon, are reported for each player in the game summaries included in the sports section of newspapers.
And players are rewarded for high numbers. Although summary statistics were widely used in baseball, data analysis per se was not. These statistics were arbitrarily decided on without much thought as to whether they actually predicted, or were related to helping a team win. This all changed with Bill James. In the late 1970s, this aspiring writer and baseball fan started publishing articles describing more in-depth analysis of baseball data. He named the approach of using data to predict what outcomes best predict if a team wins [][sabermetrics]. Until Billy Beane made sabermetrics the center of his baseball operations, Bill James’ work was mostly ignored by the baseball world.
Today, pretty much every team uses the approach, and it has gone beyond baseball into other sports. In this course, to simplify the example we use, we’ll focus on predicting scoring runs. We will ignore pitching and fielding, although those are important as well. We will see how regression analysis can help develop strategies to build a competitive baseball team with a constrained budget. [][The approach can be divided into two separate data analyses. In the first, we determine which recorded player specific statistics predict runs. In the second, we examine if players were undervalued based on what our first analysis predicts.]
[][Textbook link]
The corresponding section of the textbook is the case study on Moneyball. https://rafalab.github.io/dsbook/linear-models.html#case-study-moneyball
[][Key point]
Bill James was the originator of sabermetrics, the approach of using data to predict what outcomes best predicted if a team would win.
image here
image here
We actually don’t need to understand all the details about the game
of baseball, which has over 100 rules, to see how regression will help
us find undervalued players. Here, we distill the sport to the basic
knowledge one needs to know to effectively attack the data science
challenge. Let’s get started. [][The goal of a baseball game is to score
more runs, they’re like points, than the other team]. Each team has
nine batters that bat in a predetermined order. After the ninth
batter hits, we start with the first again. Each time they come to bat,
we call it a plate appearance, PA. At each plate appearance,
the other team’s pitcher throws the ball and you try to hit it.
The plate appearance ends with a binary outcome–you either make an
out, that’s a failure and sit back down, or you don’t, that’s a
success and you get to run around the bases and potentially score a
run
(So a run means batter from Home base to 2ed base or something???).
Each team gets nine tries, referred to as innings, to score runs. Each
inning ends after three outs, after you’ve failed three times.
From these examples, we see how luck is involved in the process. When
you bat you want to hit the ball hard. If you hit it hard enough, it’s a
home run, the best possible outcome as you get at least one automatic
run. But sometimes, due to chance, you hit the ball very hard and a
defender (who is this defender???) catches it, which makes
it an out, a failure. In contrast, sometimes you hit the ball softly but
it lands just in the right place. You get a hit which is a success. The
fact that there is chance involved hints at why probability models will
be involved in all this. Now there are [][several ways to succeed].
Understanding this distinction will be important for our
analysis.
[][When you hit the ball you want to pass as many bases as possible].
There are four bases with the fourth one called home plate. Home plate
is where you start, where you try to hit. So the bases form a cycle.
[][If you get home, you score a run]
(does this means you passes 4 bases one by one under one batter???).
We’re simplifying a bit. But there are five ways you can
succeed. In other words, not making an out. First one is called
a base on balls. [][This is when the pitcher does not
pitch well and you get to go to first base]. A single is when you hit
the ball and you get to first base. A double is when you hit the ball
and you go past first base to second. Triple is when
you do that but get to third. And [][a home run is when you hit the ball
and go all the way home and score a run]. [][If you get to a base, you
still have a chance of getting home and scoring a run if the next batter
hits successfully]. While you are on base, you can also try to [][steal
a base]. If you run fast enough, you can try to go from first to second
or from second to third without the other team tagging you.
All right. Now historically, the batting average has been considered the most important offensive statistic. To define this average, we define a [][hit] and an [][at bat]. Singles, doubles, triples, and home runs are [][hits]. But remember, there’s a fifth way to be successful, the base on balls. That is not a hit. [][An at bat] is the number of times you either get a hit or make an out, bases on balls are excluded. The batting average is simply hits divided by at bats. And it is considered the main measure of a success rate. Today, in today’s game, this success rates ranges from player to player from about 20% to 38%. We refer to the batting average in thousands. So for example, if your success rate is 25% we say you’re batting 250.
One of Bill James’ first important insights is that the [][batting average ignores bases on balls but bases on balls is a success]. So a player that gets many more bases on balls than the average player might not be recognized if he does not excel in batting average. But is this player not helping produce runs? No award is given to the player with the most bases on balls. In contrast, the total number of stolen bases are considered important and an award is given out to the player with the most. But players with high totals of stolen bases also make outs as they do not always succeed.
So does a player with a high stolen base total help produce runs? Can
we use data size to determine if it’s better to pay for bases on balls
or stolen bases? [][One of the challenges in this analysis is that it is
not obvious how to determine if a player produces runs because so much
depends on his teammates]. We do keep track of the number of runs scored
by our player. But note that if you hit after someone who hits
many home runs, you will score many runs
(Super batter hit the ball far away thus you can run many bases as well, lucky player).
But these runs don’t necessarily happen if we hire this player but not
his home run hitting teammate. [][However, we can examine team level
statistics] (How ???). How do teams with many stolen bases
compare to teams with few? How about bases on balls? We have data. Let’s
examine some.
[][Textbook link]
This video corresponds to the textbook section on baseball basics. https://rafalab.github.io/dsbook/linear-models.html#baseball-basics
[][Key points]
The goal of a baseball game is to score more runs (points) than the other team.
Each team has 9 batters who have an opportunity to hit a ball with a bat in a predetermined order.
Each time a batter has an opportunity to bat, we call it a plate appearance (PA).
The PA ends with a binary outcome: the batter either makes an out (failure) and returns to the bench or the batter doesn’t (success) and can run around the bases, and potentially score a run (reach all 4 bases).
We are simplifying a bit, but there are five ways a batter can succeed (not make an out):
Base on balls (BB): the pitcher fails to throw the ball through a predefined area considered to be hittable (the strike zone), so the batter is permitted to go to first base.
Single: the batter hits the ball and gets to first base.
Double (2B): the batter hits the ball and gets to second base.
Triple (3B): the batter hits the ball and gets to third base.
Home Run (HR): the batter hits the ball and goes all the way home and scores a run.
Historically, the batting average has been considered the most important offensive statistic. To define this average, we define a hit (H) and an at bat (AB). Singles, doubles, triples, and home runs are hits. The fifth way to be successful, a walk (BB), is not a hit. An AB is the number of times you either get a hit or make an out; BBs are excluded. The batting average is simply H/AB and is considered the main measure of a success rate.
Note: The video states that if you hit AFTER someone who hits many home runs, you will score many runs, while the textbook states that if you hit BEFORE someone who hits many home runs, you will score many runs. The textbook wording is accurate.
image here
image here
plate appearance
image here
In baseball, a home run (abbreviated HR) is scored when the ball is hit in such a way that the batter is able to circle the bases and reach home safely
image here
image here
Base on ball
A single is you hit the ball and get to first base
image here
image here
baseball home run, go all the way home and score a run
baseball steal a base
Image here
batting average equation
image here
Let’s start looking at some baseball data and try to answer your questions using these data. First one, do teams that hit more home runs score more runs? We know what the answer to this will be, but let’s look at the data anyways. We’re going to examine data from 1961 to 2001. We end at 2001 because, remember, we’re back in 2002, getting ready to build a team.
We started in 1961, because that year, the league changed from 154
games to 162 games. The visualization of choice when exploring the
relationship between two variables like home runs and runs is a
scatterplot (So we can do what we prefer I suppose).
The following code shows you how to make that scatterplot. We start by
loading the Lahman library that has all these baseball statistics. And
then we simply make a scatterplot using 2d plot. Here’s a plot of runs
per game versus home runs per game.
The plot shows a very strong association–teams with more home runs tended to score more runs. Now, let’s examine the relationship between stolen bases and wins. Here are the runs per game plotted against stolen bases per game. Here, the relationship is not as clear. Finally, let’s examine the relationship between bases on balls and runs. Here are runs per game versus bases on balls per game. Although the relationship is not as strong as it was for home runs, we do see a pretty strong relationship here.
We know that, by definition, home runs cause runs, because when you
hit a home run, at least one run will score. [][Need to stufy how
baseball rules the scores or runs, is Runs and Scores the same stuff in
baseball game ???] Now it could be that home runs also cause the bases
on balls (How ??? base on balls. This is when the
pitcher does not pitch well and you get to go to first base).
If you understand the game, you will agree with me that that could be
the case. [][So it might appear that a base on ball is causing runs,
when in fact, it’s home runs that’s causing both]. This is called
[][confounding]. An important concept you will learn about. Linear
regression will help us parse all this out and quantify the
associations. This will then help us determine what players to recruit.
Specifically, we will try to predict things like how many more runs will
the team score if we increase the number of bases on balls but keep the
home runs fixed. Regression will help us answer this question, as
well.
[][Textbook link]
This video corresponds to the base on balls or stolen bases textbook section. https://rafalab.github.io/dsbook/linear-models.html#base-on-balls-or-stolen-bases
[][Key points]
The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot.
Code: Scatterplot of the relationship between HRs and wins
library(Lahman) library(tidyverse) library(dslabs) ds_theme_set()
Teams %>% filter(yearID %in% 1961:2001) %>% mutate(HR_per_game = HR / G, R_per_game = R / G) %>% ggplot(aes(HR_per_game, R_per_game)) + geom_point(alpha = 0.5)
Code: Scatterplot of the relationship between stolen bases and wins
Teams %>% filter(yearID %in% 1961:2001) %>% mutate(SB_per_game = SB / G, R_per_game = R / G) %>% ggplot(aes(SB_per_game, R_per_game)) + geom_point(alpha = 0.5)
Code: Scatterplot of the relationship between bases on balls and runs
Teams %>% filter(yearID %in% 1961:2001) %>% mutate(BB_per_game = BB / G, R_per_game = R / G) %>% ggplot(aes(BB_per_game, R_per_game)) + geom_point(alpha = 0.5)
Image here
library(Lahman)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#ggplot2::ds_theme_set()
Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR/G, R_per_game = R/G) %>%
ggplot2::ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha=0.5)
Image here
Image here
image here
caused the both
image here
[][Textbook link]
This video corresponds to the base on balls or stolen bases textbook section. https://rafalab.github.io/dsbook/linear-models.html#base-on-balls-or-stolen-bases
[][Key points]
The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot.
Code: Scatterplot of the relationship between HRs and wins
library(Lahman) library(tidyverse) library(dslabs) ds_theme_set()
Teams %>% filter(yearID %in% 1961:2001) %>% mutate(HR_per_game = HR / G, R_per_game = R / G) %>% ggplot(aes(HR_per_game, R_per_game)) + geom_point(alpha = 0.5)
Code: Scatterplot of the relationship between stolen bases and wins
Teams %>% filter(yearID %in% 1961:2001) %>% mutate(SB_per_game = SB / G, R_per_game = R / G) %>% ggplot(aes(SB_per_game, R_per_game)) + geom_point(alpha = 0.5)
Code: Scatterplot of the relationship between bases on balls and runs
Teams %>% filter(yearID %in% 1961:2001) %>% mutate(BB_per_game = BB / G, R_per_game = R / G) %>% ggplot(aes(BB_per_game, R_per_game)) + geom_point(alpha = 0.5)
Comprehension Check due May 29, 2022 00:29 AWST Completed
1/1 point (graded) What is the application of statistics and data science to baseball called? Moneyball Sabermetrics The “Oakland A’s Approach” There is no specific name for this; it’s just data science.
1/1 point (graded) Which of the following outcomes is not included in the batting average? A home run A base on balls An out A single
1/1 point (graded) Why do we consider team statistics as well as individual player statistics? The success of any individual player also depends on the strength of their team. Team statistics can be easier to calculate. The ultimate goal of sabermetrics is to rank teams, not players.
1.0/1.0 point (graded) You want to know whether teams with more at-bats per game have more runs per game. What R code below correctly makes a scatter plot for this relationship?
Teams %>% filter(yearID %in% 1961:2001 ) %>% ggplot(aes(AB, R)) + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(AB_per_game, R_per_game)) + geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(AB_per_game, R_per_game)) + geom_line()
Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(R_per_game, AB_per_game)) + geom_point()
1.0/1.0 point (graded) What does the variable “SOA” stand for in the Teams table?
Hint: make sure to use the help file (?Teams). sacrifice out slides or attempts strikeouts by pitchers accumulated singles
1/1 point (graded)
Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001. Make a scatterplot of runs per game versus at bats (AB) per game. Which of the following is true? There is no clear relationship between runs and at bats per game. As the number of at bats per game increases, the number of runs per game tends to increase. As the number of at bats per game increases, the number of runs per game tends to decrease.
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_point(alpha = 0.5)
0/1 point (graded)
Use the filtered Teams data frame from Question 6. Make a scatterplot of win rate (number of wins per game) versus number of fielding errors (E) per game. Which of the following is true? There is no relationship between win rate and errors per game. As the number of errors per game increases, the win rate tends to increase. As the number of errors per game increases, the win rate tends to decrease.This is the answer
library(dplyr) # this is for pipe %>%
library(ggplot2)
library(Lahman)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v purrr 0.3.4
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(number_of_wins_per_game = W/G, fielding_errors_per_game = E/G) %>%
ggplot(aes(number_of_wins_per_game, fielding_errors_per_game)) + geom_point(alpha = 0.3)
# Explanation
When you examine the scatterplot, you can see a clear trend towards
decreased win rate with increasing number of errors per game
(before I wa using big scatter markersize). The following
code can be used to make the scatterplot:
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(win_rate = W / G, E_per_game = E / G) %>% ggplot(aes(win_rate, E_per_game)) + geom_point(alpha = 0.5)
Teams %>% summarise(cor(W/G, E/G))
## cor(W/G, E/G)
## 1 -0.2158873
1/1 point (graded)
Use the filtered Teams data frame from Question 6. Make a scatterplot of triples (X3B) per game versus doubles (X2B) per game. Which of the following is true? There is no clear relationship between doubles per game and triples per game. As the number of doubles per game increases, the number of triples per game tends to increase. As the number of doubles per game increases, the number of triples per game tends to decrease.
Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(triple_per_game = X3B/G, double_per_game = X2B/G) %>% + ggplot(aes(triple_per_game, double_per_game)) + + geom_point(alpha = 0.5)
Ask your questions or make your comments about Baseball as a Motivating Example here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).
Some reminders:
Search the discussion board before posting to see if someone else has asked the same thing before asking a new question Please be specific in the title and body of your post regarding which question you’re asking about to facilitate answering your question. Posting snippets of code is okay, but posting full code solutions is not. If you do post snippets of code, please format it as code for readability. If you’re not sure how to do this, there are instructions in a pinned post in the “general” discussion forum.
library(HistData) Error in library(HistData) : there is no package called ‘HistData’ install.packages(“HistData”)
library(HistData) data(“GaltonFamilies”)
galton_heights <- GaltonFamilies %>% + filter(childNum == 1 & gender == ‘male’) %>% + select(father, childHeight) %>% + rename(son = childHeight)
galton_heights %>% + summarise(mean(father), sd(father), mean(son), sd(son)) mean(father) sd(father) mean(son) sd(son) 1 69.09888 2.546555 70.45475 2.557061
galton_heights %>% + ggplot(aes(father, son)) + + geom_point(alpha=0.5)
Up to now in this series, we have focused mainly on univariate
variables. However, in data science application it is very
common to be interested in the relationship between two or more
variables.
(Google this topic and explore a case study) We saw this in
our baseball example in which we were interested in the relationship,
for example, between bases on balls and runs. we’ll come back to this
example, but we introduce the concepts of correlation and regression
using a simpler example.
We’ll create a data set with the heights of fathers and the first sons. The actual data Galton used to discover and define regression. So we have the father and son height data. Suppose we were to summarize these data. Since both distributions are well approximated by normal distributions, we can use the two averages and two standard deviations as summaries. Here they are.
However, this summary fails to describe a very important
characteristic of the data that you can see in this figure. The trend
that the taller the father, the taller the son, is not described by the
summary statistics of the average and the standard deviation. We will
learn that the correlation coefficient is a summary of this
trend.(Interesting here how the instructor jumped in the topic of his teaching)
[][Textbook link]
The corresponding textbook section is Case Study: is height hereditary? https://rafalab.github.io/dsbook/regression.html#case-study-is-height-hereditary
[][Key points]
Galton tried to predict sons' heights based on fathers' heights.
The mean and standard errors are insufficient for describing an important characteristic of the data: the trend that the taller the father, the taller the son.
The correlation coefficient is an informative summary of how two variables move together that can be used to predict one variable using the other.
Code
library(tidyverse) library(HistData) data(“GaltonFamilies”) set.seed(1983) galton_heights <- GaltonFamilies %>% filter(gender == “male”) %>% group_by(family) %>% sample_n(1) %>% ungroup() %>% select(father, childHeight) %>% rename(son = childHeight)
galton_heights %>% summarize(mean(father), sd(father), mean(son), sd(son))
galton_heights %>% ggplot(aes(father, son)) + geom_point(alpha = 0.5)
library(dplyr)
library(Lahman)
library(HistData)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == 'male') %>%
select(father, childHeight) %>%
rename(son=childHeight)
galton_heights %>%
summarise(mean(father), sd(father), mean(son), sd(son))
## mean(father) sd(father) mean(son) sd(son)
## 1 69.09888 2.546555 70.45475 2.557061
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha=0.5)
The correlation coefficient is defined for a list of pairs–x1, y1 through xn, yn– with the following formula. Here, mu x and mu y are the averages of x and y, respectively. And sigma x and sigma y are the standard deviations. The Greek letter rho is commonly used in the statistics book, to denote this correlation. The reason is that rho is the Greek letter for r, the first letter of the word regression.
Soon, we will learn about the connection between correlation and regression. To understand why this equation does, in fact, summarize how two variables move together, consider the i-th entry of x is xi minus mu x divided by sigma x SDs away from the average. Similarly, the yi– which is paired with the xi–is yi minus mu y divided by sigma y SDs away from the average y.
If x and y are unrelated, then the product of these two quantities will be positive. That happens when they are both positive or when they are both negative as often as they will be negative. That happens when one is positive and the other is negative, or the other way around. One is negative and the other one is positive. This will average to about 0. The correlation is this average.
And therefore, unrelated variables will have a correlation of
about 0 (Why??? Sorry I didn't get it). If instead
the quantities vary together, then we are averaging mostly positive
products. Because they’re going to be either positive times positive or
negative times negative. And we get a positive correlation. If they vary
in opposite directions, we get a negative correlation.
Another thing to know is that we can show mathematically that the correlation is always between negative 1 and 1. To see this, consider that we can have higher correlation than when we compare a list to itself. That would be perfect correlation. In this case, the correlation is given by this equation, which we can show is equal to 1. A similar argument with x and its exact opposite, negative x, proves that the correlation has to be greater or equal to negative 1. So it’s between minus 1 and 1.
To see what data looks like for other values of rho, here are six examples of pairs with correlations ranging from negative 0.9 to 0.99. When the correlation is negative, we see that they go in opposite direction. As x increases, y decreases. When the correlation gets either closer to 1 or negative 1, we see the clot of points getting thinner and thinner. When the correlation is 0, we just see a big circle of points.
[][Textbook link]
This video corresponds to the correlation coefficient section of the textbook. https://rafalab.github.io/dsbook/regression.html#the-correlation-coefficient
[][Key points]
The correlation coefficient is defined for a list of pairs
(x_1, y_1), ..., (x_n, y_n)
as the product of the standardized values:
((x_i - mu_x)/Sigma_x) * ((y_i - mu_y)/Sigma_y)
.
The correlation coefficient essentially conveys how two variables move together.
The correlation coefficient is always between -1 and 1.
Code
rho <- mean(scale(x)*scale(y)) galton_heights %>% summarize(r = cor(father, son)) %>% pull(r)
alt text here.
alt text here.
alt text here.
alt text here.
alt text here.
alt text here.
alt text here.
alt text here.
library(Lahman)
library(dplyr)
library(HistData)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight)
galton_heights %>% summarise(cor(father, son))
## cor(father, son)
## 1 0.5007248
alt text here.
Before we continue describing regression, let’s go over a reminder about random variability. In most data science applications, we do not observe the population, but rather a sample.
As with the average and standard deviation, the sample correlation is the most commonly used estimate of the population correlation. This implies that the correlation we compute and use as a summary is a random variable.
As an illustration, let’s assume that the 179 pairs of fathers and sons is our entire population. A less fortunate geneticist can only afford to take a random sample of 25 pairs. The sample correlation for this random sample can be computed using this code. Here, the variable R is the random variable. We can run a monte-carlo simulation to see the distribution of this random variable.
Here, we recreate R 1000 times, and plot its histogram. We see that the expected value is the population correlation, the mean of these Rs is 0.5, and that it has a relatively high standard error relative to its size, SD 0.147. This is something to keep in mind when interpreting correlations. It is a random variable, and it can have a pretty large standard error.
Also note that because the sample correlation is an average
of independent draws (independent? average? how?),
the Central Limit Theorem actually applies. Therefore, for a large
enough sample size N, the distribution of these Rs is approximately
normal.
The expected value we know is the population correlation. The
standard deviation is somewhat more complex to derive, but this is the
actual formula here. In our example, N equals to 25, does not appear to
be large enough to make the approximation a good one
(how to identify a good one ??? Should the standard deviation equal to a normal distribution or something ???),
as we see in this QQ-plot.
[][Textbook link]
This video corresponds to the textbook section titled: Sample correlation is a random variable. https://rafalab.github.io/dsbook/regression.html#sample-correlation-is-a-random-variable
[][Key points]
The correlation that we compute and use as a summary is a random variable.
When interpreting correlations, it is important to remember that correlations derived from samples are estimates containing uncertainty.
Because the sample correlation is an average of independent draws, the central limit theorem applies.
Code
R <- sample_n(galton_heights, 25, replace = TRUE) %>% summarize(r = cor(father, son)) R
B <- 1000 N <- 25 R <- replicate(B, { sample_n(galton_heights, N, replace = TRUE) %>% summarize(r = cor(father, son)) %>% pull(r) }) qplot(R, geom = “histogram”, binwidth = 0.05, color = I(“black”))
mean(R) sd(R)
data.frame(R) %>% ggplot(aes(sample = R)) + stat_qq() + geom_abline(intercept = mean(R), slope = sqrt((1-mean(R)^2)/(N-2)))
library(Lahman)
library(dplyr)
library(HistData)
library(stats)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight)
set.seed(0)
R <- sample_n(galton_heights, 25, replace=TRUE) %>%
summarize(cor(father, son))
R
## cor(father, son)
## 1 0.5889351
library(dplyr)
library(ggplot2)
B <- 1000
N <- 25
R <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
summarize(r = cor(father, son)) %>% .$r
})
# ========================================================================================================
# Using $ Operator to Access Data Frame Column.
# Using . to do what
data.frame(R) %>%
ggplot(aes(R)) + geom_histogram(binwidth=0.05, color='black')
mean(R)
## [1] 0.5040874
sd(R)
## [1] 0.1439084
library(Lahman)
library(dplyr)
library(HistData)
library(stats)
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
filter(childNum == 1 & gender == "male") %>%
select(father, childHeight) %>%
rename(son = childHeight)
set.seed(0)
R <- sample_n(galton_heights, 25, replace=TRUE) %>%
summarize(mean(father), sd(father), mean(son), sd(son))
R
## mean(father) sd(father) mean(son) sd(son)
## 1 69.232 2.302303 70.632 2.107273
alt text here.
alt text here
1/1 point (graded) While studying heredity, Francis Galton developed what important statistical concept? Standard deviation Normal distribution Correlation Probability
1/1 point (graded) The correlation coefficient is a summary of what? The trend between two variables The dispersion of a variable The central tendency of a variable The distribution of a variable correct
1/1 point (graded)
Below is a scatter plot showing the relationship between two variables, x and y. Scatter plot of relationship between x (plotted on the x-axis) and y (plotted on the y-axis). y-axis values range from -3 to 3; x-axis values range from -3 to 3. Points are fairly well distributed in a tight band with a range from approximately (-2, 2) to (3, -3).
From this figure, the correlation between
x and y appears to be about: -0.9 -0.2 0.9 2
1/1 point (graded)
Instead of running a Monte Carlo simulation with a sample size of 25 from the 179 father-son pairs described in the videos, we now run our simulation with a sample size of 50. Would you expect the mean of our sample correlation to increase, decrease, or stay approximately the same? Increase Decrease Stay approximately the same
1/1 point (graded)
Instead of running a Monte Carlo simulation with a sample size of 25 from the 179 father-son pairs described in the videos, we now run our simulation with a sample size of 50. Would you expect the standard deviation of our sample correlation to increase, decrease, or stay approximately the same? Increase Decrease Stay approximately the same
1/1 point (graded) If X and Y are completely independent, what do you expect the value of the correlation coefficient to be? -1 -0.5 0 0.5 1 Not enough information to answer the question
1/1 point (graded)
Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001. What is the correlation coefficient between number of runs per game and number of at bats per game? correct 0.6580976
Loading You have used 1 of 10 attempts Some
1/1 point (graded)
Use the filtered Teams data frame from Question 7. What is the correlation coefficient between win rate (number of wins per game) and number of errors per game? correct -0.3396947
Loading You have used 1 of 10 attempts Some
1/1 point (graded)
Use the filtered Teams data frame from Question 7. What is the correlation coefficient between doubles (X2B) per game and triples (X3B) per game? correct -0.01157404
Loading You have used 1 of 10 attempts Some
library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyverse)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(number_of_runs_per_game = R/G, number_of_bats_per_game = AB/G) %>%
ggplot2::ggplot(aes(number_of_runs_per_game, number_of_bats_per_game)) + geom_point(alpha = 0.5)
# https://stackoverflow.com/questions/60901319/r-language-registered-s3-method-overwritten-by-data-table
library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyverse)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
summarize(cor(R/G, AB/G))
## cor(R/G, AB/G)
## 1 0.6580976
library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyverse)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
summarize(cor(W/G, E/G))
## cor(W/G, E/G)
## 1 -0.3396947
library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyverse)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
summarize(cor(X2B/G, X3B/G))
## cor(X2B/G, X3B/G)
## 1 -0.01157404
Ask your questions or make your comments about Correlation here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).
Some reminders:
Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
Posting snippets of code is okay, but posting full code solutions is not.
If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.
Correlation is not always a good summary of the relationship between two variables. A famous example used to illustrate this are the following for artificial data sets, referred to as Anscombe’s quartet. All of these pairs have a correlation of 0.82. Correlation is only meaningful in a particular context.
To help us understand when it is that correlation is meaningful as a summary statistic, we’ll try to predict the son’s height using the father’s height. This will help motivate and define linear regression. We start by demonstrating how correlation can be useful for prediction. Suppose we are asked to guess the height of a randomly selected son. Because of the distribution of the son height is approximately normal, we know that the average height of 70.5 inches is a value with the highest proportion and would be the prediction with the chances of minimizing the error. But what if we are told that the father is 72 inches? Do we still guess 70.5 inches for the son? The father is taller than average, specifically he is 1.14 standard deviations taller than the average father. So shall we predict that the son is also 1.14 standard deviations taller than the average son? It turns out that this would be an overestimate.
To see this, we look at all the sons with fathers who are about 72 inches. We do this by stratifying the father’s height. We call this a conditional average, since we are computing the average son height conditioned on the father being 72 inches tall. A challenge when using this approach in practice is that we don’t have many fathers that are exactly 72. In our data set, we only have eight. If we change the number to 72.5, we would only have one father who is that height. This would result in averages with large standard errors, and they won’t be useful for prediction for this reason. But for now, what we’ll do is we’ll take an approach of creating strata of fathers with very similar heights. Specifically, we will round fathers’ heights to the nearest inch. This gives us the following prediction for the son of a father that is approximately 72 inches tall. We can use this code and get our answer, which is 71.84. This is 0.54 standard deviations larger than the average son, a smaller number than the 1.14 standard deviations taller that the father was above the average father. Stratification followed by box plots lets us see the distribution of each group. Here is that plot.
We can see that the centers of these groups are increasing with
height, not surprisingly. The means of each group appear to follow a
linear relationship
(then why we supposed to use a scatter plot to explore relationship between two variales in Exploratory Data Analysis in Python DataCamp course ? I mean many dots covering on each other, Using a boxplot would avoid this condition ???).
We can make that plot like this, with this code. See the plot and notice
that this appears to follow a line. The slope of this line appears to be
about 0.5, which happens to be the correlation between father and son
heights. This is not a coincidence. To see this
connection, let’s plot the standardized heights
(Why??? standardize? can we just use the boxplot or the grouped mean?)
against each other, son versus father, with a line that has a slope
equal to the correlation. (Think, Think, Think) [][****Read
the important comments in below standardized plot, and think about how
its all related to each other****]
Here’s the code. Here’s a plot. This line is what we call the
regression line. In a later video, we will describe Galton’s theoretical
justification for using this line to estimate conditional means. Here,
we define it and compute it for the data at hand. The regression line
for two variables, x and y, tells us that [][for every
standard deviation (sigma x) increase above the average (mu x). For x, y
grows rho standard deviations (sigma y) above the average (mu
y).]. The formula for the regression line is therefore
this one (Think, Think, Think)[][How does this comes
out???]. If there’s perfect correlation, we predict an increase that is
the same number of SDs. If there’s zero correlation, then we don’t use x
at all for the prediction of y. For values between 0 and 1, the
prediction is somewhere in between. If the correlation is negative, we
predict a reduction, instead of an increase.
It is because when the correlation is positive but lower
than the one, that we predict something closer to the mean
(it has to be the normal distribution, enough sized sample),
that we call this regression. The son regresses to the average
height.
In fact, the title of Galton’s paper was “Regression Towards Mediocrity in Hereditary Stature.” Note that if we write this in the standard form of a line, y equals b plus mx, where b is the intercept and m is the slope, the regression line has slope rho times sigma y, divided by sigma x, and intercept mu y, minus mu x, times the slope. So if we standardize the variable so they have average 0 and standard deviation 1. Then the regression line has intercept 0 and slope equal to the correlation rho. Let’s look at the original data, father son data, and add the regression line. We can compute the intercept and the slope using the formulas we just derived. Here’s a code to make the plot with the regression line. If we plot the data in standard units, then, as we discussed, the regression line as intercept 0 and slope rho. Here’s the code to make that plot.
We started this discussion by saying that we wanted to use the
conditional means to predict the heights of the sons. But then we
realized that there were very few data points in each strata. When we
did this approximation of rounding off the height of the fathers
(the boxplot), we found that these conditional means appear
to follow a line. And we ended up with the regression line
(that is if we standardize both variables - father, son).
****So the regression line gives us the prediction****. An advantage of
using the regression line is that we used all the data to estimate
just two parameters, the slope and the intercept. This makes it much
more stable. When we do conditional means, we had fewer data
points, which made the estimates have a large standard error, and
therefore be unstable. So this is going to give us a much more stable
prediction using the regression line. However, are we justified in using
the regression line to predict? Galton gives us the answer.
[][Textbook link]
There are three links to relevant sections of the textbook for this video:
correlation is not always a useful summary
https://rafalab.github.io/dsbook/regression.html#correlation-is-not-always-a-useful-summary
conditional expectation
https://rafalab.github.io/dsbook/regression.html#conditional-expectation
the regression line
https://rafalab.github.io/dsbook/regression.html#the-regression-line
[][Key points]
Correlation is not always a good summary of the relationship between two variables.
The general idea of conditional expectation is that we stratify a population into groups and compute summaries in each group.
A practical way to improve the estimates of the conditional expectations is to define strata of with similar values of x.
If there is perfect correlation, the regression line predicts an increase that is the same number of SDs for both variables. If there is 0 correlation, then we don’t use x at all for the prediction and simply predict the average . For values between 0 and 1, the prediction is somewhere in between. If the correlation is negative, we predict a reduction instead of an increase.
Code
sum(galton_heights\(father == 72) sum(galton_heights\)father == 72.5)
conditional_avg <- galton_heights %>% filter(round(father) == 72) %>% summarize(avg = mean(son)) %>% pull(avg) conditional_avg
galton_heights %>% mutate(father_strata = factor(round(father))) %>% ggplot(aes(father_strata, son)) + geom_boxplot() + geom_point()
galton_heights %>% mutate(father = round(father)) %>% group_by(father) %>% summarize(son_conditional_avg = mean(son)) %>% ggplot(aes(father, son_conditional_avg)) + geom_point()
mu_x <- mean(galton_heights\(father) mu_y <- mean(galton_heights\)son) s_x <- sd(galton_heights\(father) s_y <- sd(galton_heights\)son) r <- cor(galton_heights\(father, galton_heights\)son) m <- r * s_y/s_x b <- mu_y - m*mu_x
galton_heights %>% ggplot(aes(father, son)) + geom_point(alpha = 0.5) + geom_abline(intercept = b, slope = m)
image here
image here
image here
image here
image here
conditional_avg <- galton_heights %>%
filter(round(father)==72) %>%
summarise(avg=mean(son)) %>% .$avg
conditional_avg
## [1] 71.83571
galton_heights %>%
mutate(father_strata = factor(round(father))) %>%
ggplot2::ggplot(aes(father_strata, son)) +
geom_boxplot() +
geom_point()
galton_heights %>%
mutate(father=round(father)) %>%
group_by(father) %>%
summarise(son_conditional_avg = mean(son)) %>%
ggplot2::ggplot(aes(father, son_conditional_avg)) +
geom_point()
image here
r <- galton_heights %>% summarise(r = cor(father, son)) %>% .$r
galton_heights %>%
mutate(father=round(father)) %>%
group_by(father) %>%
summarise(son = mean(son)) %>%
mutate(z_father = scale(father), z_son = scale(son)) %>%
ggplot2::ggplot(aes(z_father, z_son)) +
geom_point() +
geom_abline(intercept = 0, slope = r)
# **Why do we use scale function in R?**
# When we want to scale the values in several columns of a data frame so that each column has a mean of 0 and a standard deviation of 1, we usually use the scale() function.
r <- galton_heights %>% summarise(r = cor(father, son)) %>% .$r
galton_heights %>%
mutate(father=round(father)) %>%
group_by(father) %>%
summarise(son = mean(son)) %>%
mutate(z_father = father, z_son = son) %>%
ggplot2::ggplot(aes(z_father, z_son)) +
geom_point() +
geom_abline(intercept = 35.5, slope = r)
# So the reason for standardize is the intercept is easy to define, or we have to guess until its 35 or something, and also the two plot seems different, this one and the standardized one. Recall how correlation is calculated, and how scale() function standardizing the data: (x - mean(x)) / sd(x)
image here
a <- galton_heights %>%
mutate(father=round(father)) %>%
group_by(father) %>%
summarise(son = mean(son)) %>%
mutate(z_father = scale(father), z_son = scale(son))
# https://stackoverflow.com/questions/20256028/understanding-scale-in-r
a
## # A tibble: 15 x 4
## father son z_father[,1] z_son[,1]
## <dbl> <dbl> <dbl> <dbl>
## 1 62 65.2 -1.70 -2.14
## 2 64 68.1 -1.28 -1.00
## 3 65 67.6 -1.06 -1.21
## 4 66 69.2 -0.850 -0.566
## 5 67 70.0 -0.638 -0.262
## 6 68 69.2 -0.425 -0.572
## 7 69 71.2 -0.213 0.231
## 8 70 71.2 0 0.198
## 9 71 71.5 0.213 0.341
## 10 72 71.8 0.425 0.469
## 11 73 71.5 0.638 0.350
## 12 74 75.2 0.850 1.82
## 13 75 71.2 1.06 0.204
## 14 76 73.5 1.28 1.13
## 15 78 73.2 1.70 1.01
Read the below statement
image here
image here
image here
image here
image here
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y / s_x
b <- mu_y - m * mu_x
galton_heights %>%
ggplot2::ggplot(aes(father, son)) +
geom_point(alpha=0.3) +
geom_abline(intercept = b, slope=m)
galton_heights %>%
ggplot2::ggplot(aes(scale(father), scale(son))) +
geom_point(alpha=0.3) +
geom_abline(intercept = 0, slope = r)
image here
Correlation and the regression line are widely used summary statistics. But it is often misused or misinterpreted. Anscombe’s example provided toy example of data sets in which summarizing with a correlation would be a mistake. But we also see it in the media and in scientific literature as well.
The main way we motivate the use of correlation involve
what is called the bivariate normal distribution. When a
pair of random variables is approximated by a bivariate normal
distribution, the scatterplot looks like ovals, like American footballs.
They can be thin. That’s when they have high correlation. All the way up
to a circle shape when they have no correlation. We saw some examples
previously. Here they are again. A more technical way to define the
bivariate normal distribution is the following. First, this
distribution is defined for pairs
(remember the father - son paris we used earlier). So we
have two variables, x and y. And they have paired values. They are going
to be bivariate normally distributed if the following happens.
If x is a normally distributed random variable, and y is also a
normally distributed random variable–and for any grouping of x that we
can define, say, with x being equal to some predetermined value, which
we call here in this formula little x–then the y’s in that group are
approximately normal as well. If this happens, then the pair is
approximately bivariate normal. When we fix x in this way, we then refer
to the resulting distribution of the y’s in the group– defined by
setting x in this way– as the conditional distribution of y given x.
(Remember we did this before, fix father 72 inches, but the sample size is too small)
We write the notation like this for the conditional distribution and the
conditional expectation. If we think the height data is
well-approximated by the bivariate normal distribution, then we should
see the normal approximation hold for each grouping.
Here, we stratify the son height by the standardized father heights
and see that the assumption appears to hold. Here’s the code that gives
us the desired plot. Now, we come back to defining correlation. Galton
showed– using mathematical statistics– that when two variables
follow a bivariate normal distribution, then for any given x the
expected value of the y in pairs for which x is set at that value is mu
y plus rho x minus mu x divided by sigma x times sigma y. Note that
this is a line with slope rho times sigma y divided by sigma x and
intercept mu y minus n times mu x. And therefore, this is the same as
the regression line we saw in a previous video.
(Now you must understand the slope difference between standalized vairable and non-standarilized vaiable in father son case we did in prevrious chapter)
That can be written like this. So in summary, if our data is
approximately bivariate, then the conditional expectation–which is the
best prediction for y given that we know the value of x–is given by the
regression line.
[][Textbook link]
This video corresponds to the textbook section on the bivariate normal distribution (advanced). https://rafalab.github.io/dsbook/regression.html#bivariate-normal-distribution-advanced
[][Key points]
When a pair of random variables are approximated by the bivariate normal distribution, scatterplots look like ovals. They can be thin (high correlation) or circle-shaped (no correlation).
When two variables follow a bivariate normal distribution, computing the regression line is equivalent to computing conditional expectations.
We can obtain a much more stable estimate of the conditional expectation by finding the regression line and using it to make predictions.
Code
galton_heights %>% mutate(z_father = round((father - mean(father))
/ sd(father))) %>% filter(z_father %in% -2:2) %>% ggplot() +
stat_qq(aes(sample = son)) + facet_wrap( ~ z_father)
image here
image here
image here
image here
images here
images here
image here
image here
image here
galton_heights %>%
#mutate(z_father=round((father-mean(father))/sd(father))) %>%
mutate(z_father=round(scale(father))) %>% # Do you recall this
filter(z_father %in% -2:2) %>%
ggplot2::ggplot() +
stat_qq(aes(sample=son)) +
facet_wrap(~z_father)
# What does a QQ plot show?
# The purpose of the quantile-quantile (QQ) plot is to show if two data sets come from the same distribution. Plotting the first data set's quantiles along the x-axis and plotting the second data set's quantiles along the y-axis is how the plot is constructed.
# https://math.illinois.edu/system/files/inline-files/Proj9AY1516-report2.pdf
# Where does above formula comes from???
image here
image here
image here
image here
image here
So this video is recorded long times ago, how interesting)The equation shown at 0:10 is for the standard deviation of the conditional distribution, not the variance. **The variance is the standard deviation squared**. See the notes below the video for more clarification.
(The theory we’ve been
describing also tells us that the standard deviation of the conditional
distribution that we described in a previous video is Var
(
Correction: The equation shown at 0:10 is for the standard deviation of the conditional distribution, not the variance)
of Y given X equals sigma y times the square root of 1 minus rho
squared. This is where statements like x explains such and such
percent of the variation in y comes from
(Why saying this ??? Where does this equation comes from ???[][Google
this topic]). Note that the variance of y is sigma squared. That’s where
we start. If we condition on x, then the variance goes down to 1
minus rho squared times sigma squared y [][Now what??? How did
we get that equation???]. So from there, we can compute how much the
variance has gone down. It has gone down by rho squared times
100%[][What are you talking about???]. So the correlation and the
amount of variance explained are related to each other. But it is
important to remember that the variance explained statement only makes
sense when the data is by a bivariate normal distribution.
[][Read all the course material and thinking, then trying to answer your own questions, its a good way of learning, and thinking]
[][Textbook link]
This video corresponds to the textbook section on variance explained. # https://rafalab.github.io/dsbook/regression.html#variance-explained
[][Key points]
Conditioning on a random variable X can help to reduce variance of response variable Y.
The standard deviation of the conditional distribution is
SD(Y|X=x) = Sigma_y * SquaredRoot(1 - rho**2),
which is smaller than the standard deviation without conditioning sigma_y.
Because variance is the standard deviation squared, the variance of the conditional distribution is:
Var(Y\X=x) = Sigma_y^(2) * (1 - rho**2).
In the statement "X explains such and such percent of the variability," the percent value refers to the variance. The variance decreases by \(\rho^2\) percent.
The “variance explained” statement only makes sense when the data is approximated by a bivariate normal distribution.
# Where does above equation comes from ???
image here
image here
image here
image here
image here
We computed a regression line to predict the son’s height from the father’s height. We used these calculations– here’s the code–to get the slope and the intercept. This gives us the function that the conditional expectation of y given x is 35.7 plus 0.5 times x.
So, what if we wanted to predict the father’s height based on the
son’s? It is important to know that this is not determined by computing
the inverse function of what we just saw, which would be this equation
here. [][We need to compute the expected value of x given y]. This gives
us another regression function altogether, with slope and intercept
computed like this. (How did this comes from??) So now we
get that the expected value of x given y, or the expected value of the
father’s height given the son’s height, is equal to 34 plus 0.5 y, a
different regression line.
So in summary, it’s important to remember that the regression line comes from computing expectations, and these give you two different lines, depending on if you compute the expectation of y given x or x given y.
[][Textbook link]
The link to the corresponding section of the textbook is warning: there are two regression lines. https://rafalab.github.io/dsbook/regression.html#warning-there-are-two-regression-lines
[][Key point] There are two different regression lines depending on whether we are taking the expectation of Y given X or taking the expectation of X given Y. Code
mu_x <- mean(galton_heights\(father) mu_y <- mean(galton_heights\)son) s_x <- sd(galton_heights\(father) s_y <- sd(galton_heights\)son) r <- cor(galton_heights\(father, galton_heights\)son) m_1 <- r * s_y / s_x b_1 <- mu_y - m_1*mu_x
m_2 <- r * s_x / s_y b_2 <- mu_x - m_2*mu_y
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
x_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y/s_x # Thus the variance should be changed to variance**2, why made mistakes, poor Harvard
b <- mu_y - m*mu_x
m
## [1] 0.5027904
b
## [1] 35.71249
Song predict father wrong
compute expected value of x given y
m <- r * s_x/s_y
b <- mu_x - m*mu_y
m
## [1] 0.4986676
b
## [1] 33.96539
image here
two different lines depending on what you do
Look at the figure below. Scatter plot of son and father heights with
son heights on the y-axis and father heights on the x-axis. There is
also a regression line that runs from roughly (63,66) to (78,76). The
dots on the plot are scattered around the line. The slope of the
regression line in this figure is equal to what, in words? Slope
= (correlation coefficient of son and father heights) * (standard
deviation of sons’ heights / standard deviation of fathers’
heights) Slope = (correlation coefficient of son and father
heights) * (standard deviation of fathers’ heights / standard deviation
of sons’ heights) Slope = (correlation coefficient of son and father
heights) / (standard deviation of sons’ heights * standard deviation of
fathers’ heights) Slope = (mean height of fathers) - (correlation
coefficient of son and father heights * mean height of sons).
1 point possible (graded) Why does the regression line simplify to a line with intercept zero and slope rho when we standardize our x and y variables?
Try the simplification on your own first! When we standardize variables, both x and y will have a mean of one and a standard deviation of zero. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho * x_i. When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho * x_i. When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho + x_i.
1 point possible (graded) What is a limitation of calculating conditional means?
Select ALL that apply. Each stratum we condition on (e.g., a specific father’s height) may not have many data points. Because there are limited data points for each stratum, our average values have large standard errors. Conditional means are less stable than a regression line. Conditional means are a useful theoretical tool but cannot be calculated.
1/1 point (graded) A regression line is the best prediction of Y given we know the value of X when: X and Y follow a bivariate normal distribution. Both X and Y are normally distributed. Both X and Y have been standardized. There are at least 25 X-Y pairs.
0/1 point (graded) Which one of the following scatterplots depicts an x and y distribution that is NOT well-approximated by the bivariate normal distribution?
I chose 3, but false, why.
The v-shaped distribution of points from the first plot means that the x and y variables do not follow a bivariate normal distribution.
When a pair of random variables is approximated by a bivariate normal, the scatter plot looks like an oval (as in the 2nd, 3rd, and 4th plots) - [][it is okay if the oval is very round (as in the 3rd plot) or long and thin (as in the 4th plot)].
0/1 point (graded)
We previously calculated that the correlation coefficient between
fathers’ and sons’ heights is 0.5. Given this, what percent of the
variation in sons’ heights is explained by fathers’ heights? 0% 25%
50% 75% incorrect I choose 50% which is incorrect,
Think, Think, Think Answer Incorrect: Try again. [][When two variables
follow a bivariate normal distribution, the variation explained can be
calculated as rho^2 x
100](How does this comes from???).
1/1 point (graded)
Suppose the correlation between father and son’s height is 0.5, the standard deviation of fathers’ heights is 2 inches, and the standard deviation of sons’ heights is 3 inches. Given a one inch increase in a father’s height, what is the predicted change in the son’s height? 0.333 0.5 0.667 0.75 1 1.5 correct Answer Correct: Correct! TThe slope of the regression line is calculated by multiplying the correlation coefficient by the ratio of the standard deviation of son heights and standard deviation of father heights: var_son/var_father. (Note: here he means SD_son/SD_father, its a mistake) .
Comprehension Check due May 29, 2022 00:29 AWST
In the second part of this assessment, you’ll analyze a set of mother and daughter heights, also from GaltonFamilies.
Define female_heights, a set of mother and daughter heights sampled from GaltonFamilies, as follows:
set.seed(1989) #if you are using R 3.5 or earlier set.seed(1989, sample.kind=“Rounding”) #if you are using R 3.6 or later library(HistData) data(“GaltonFamilies”)
female_heights <- GaltonFamilies%>%
filter(gender == “female”) %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)
set.seed(1989, sample.kind="Rounding") #if you are using R 3.6 or later
## Warning in set.seed(1989, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
library(HistData)
data("GaltonFamilies")
female_heights <- GaltonFamilies %>%
filter(gender == "female") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)
mean(female_heights$mother)
## [1] 64.125
sd(female_heights$mother)
## [1] 2.289292
mean(female_heights$daughter)
## [1] 64.28011
sd(female_heights$daughter)
## [1] 2.39416
cor(female_heights$mother, female_heights$daughter)
## [1] 0.3245199
mu_x <- mean(female_heights$mother)
mu_y <- mean(female_heights$daughter)
s_x <- sd(female_heights$mother)
s_y <- sd(female_heights$daughter)
r <- cor(female_heights$mother, female_heights$daughter)
m <- r * s_y/s_x
b <- mu_y - m*mu_x
m
## [1] 0.3393856
b
## [1] 42.51701
m*m
## [1] 0.1151826
outcome <- m * 60 + b
outcome
## [1] 62.88015
***Recall what we did in above courses***
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
x_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y/s_x # Thus the variance should be changed to variance**2, why made mistakes, poor Harvard
b <- mu_y - m*mu_x
5/5 points (graded)
Calculate the mean and standard deviation of mothers’ heights, the mean and standard deviation of daughters’ heights, and the correlaton coefficient between mother and daughter heights. Mean of mothers’ heights correct 64.125
Loading Standard deviation of mothers’ heights correct 2.289292
Loading Mean of daughters’ heights correct 64.28011
Loading Standard deviation of daughters’ heights correct 2.39416
Loading Correlation coefficient correct 0.3245199
Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (5/5 points)
3/3 points (graded)
Calculate the slope and intercept of the regression line predicting daughters’ heights given mothers’ heights. Given an increase in mother’s height by 1 inch, how many inches is the daughter’s height expected to change? Slope of regression line predicting daughters’ height from mothers’ heights correct 0.3393856
Loading Intercept of regression line predicting daughters’ height from mothers’ heights correct 42.51701
Loading Change in daughter’s height in inches given a 1 inch increase in the mother’s height correct 0.3393856
Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (3/3 points)
1/1 point (graded) What percent of the variability in daughter heights is explained by the mother’s height?
Report your answer as a value between 0 and 100. Do NOT include the percent symbol (%) in your submission. correct 11
Loading You have used 4 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (1/1 point)
1/1 point (graded)
A mother has a height of 60 inches. Using the regression formula, what is the conditional expected value of her daughter’s height given the mother’s height? correct 62.88015
Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (1/1 point)
Ask your questions or make your comments about Stratification and Variance Explained here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).
Some reminders:
Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
Posting snippets of code is okay, but posting full code solutions is not.
If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.
In the Linear Models section, you will learn how to do linear regression.
After completing this section, you will be able to:
Use multivariate regression to adjust for confounders.
Write linear models to describe the relationship between two or more variables.
Calculate the least squares estimates for a regression model using the lm function.
Understand the differences between tibbles and data frames.
Use the do() function to bridge R functions and the tidyverse.
Use the tidy(), glance(), and augment() functions from the broom package.
Apply linear regression to measurement error models.
This section has four parts: Introduction to Linear Models, Least Squares Estimates, Tibbles, do, and broom, and Regression and Baseball. There are comprehension checks at the end of each part, along with an assessment on linear models at the end of the whole section for Verified learners only.
We encourage you to use R to interactively test out your answers and further your own learning. If you get stuck, we encourage you to search the discussion boards for the answer to your issue or ask us for help!
In a previous video, we found that the slope of the regression line
for predicting runs from bases on balls was 0.735. So, does this
mean that if we go and hire low salary players with many bases on balls
that increases the number of walks per game by 2 for our team? Our team
will score 1.47 more runs per game? [][We are again reminded
that association is not causation]. The data does provide strong
evidence that a team with 2 more bases on balls per game than the
average team scores 1.47 more runs per game, but this does not mean that
bases on balls are the cause. If we do compute the regression line slope
for singles, we get 0.449, a lower value. Note that a single gets you to
first base just like a base on balls. Those that know a little bit more
about baseball will tell you that with a single, runners that are on
base have a better chance of scoring than with a base on balls
(DId you see the logic conflict here).
So, how can base on balls be more predictive of runs? The reason this happens is because of [][confounding]. Note the correlation between homeruns, bases on balls, and singles. We see that the correlation between bases on balls and homeruns is quite high compared to the other two pairs. [][It turns out that pitchers, afraid of homeruns, will sometimes avoid throwing strikes to homerun hitters]. As a result, homerun hitters tend to have more bases on balls. Thus, a team with many homeruns will also have more bases on balls than average, and as a result, it may appear that bases on balls cause runs. But it is actually the homeruns that caused the runs.
In this case, we say that bases on balls are confounded with homeruns. [][But could it be that bases on balls still help? To find out, we somehow have to adjust for the homerun effect. Regression can help with this]. =====================================================================================
library(dplyr )
library(Lahman)
library(ggplot2)
#library(tidyverse)
#library(dslabs)
Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(Singles = (H - HR - X2B - X3B)/G, BB = BB/G, HR = HR/G) %>%
summarize(cor(BB, HR), cor(Singles, HR), cor(BB, Singles))
## cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1 0.4039313 -0.1737435 -0.05603822
[][Did you see the logic conflict here ??? Single-R slope
< BB-R slope, but Single gives runner better chance. Now
interesting]++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++